Cross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources

نویسنده

  • Ralf Steinberger
چکیده

Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary JRC-Who we are • European Commission (scientific-technical arm of public administration) • Non-commercial • Multidisciplinary / multilingual • Main product: Europe Media Monitor (EMM) • ~ 150,000 online news articles / day in ~ 50 languages • ~ 3600 Sources (worldwide , with focus on Europe) • In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Monolingual PD work • N-gram overlap between pairs of documents • Karp-Rabin algorithm, using word 5-grams • to weed out duplicates in the IAEA document database (ca. 350K documents) • to find news article near-duplicates in EMM (applied to all news clusters) • Method: Search for longest (in chars) word 6-grams of each document in EC database and on the web (avoiding strings from document template) • If target documents pass similarity threshold: • Full-text comparison of matching documents to detect significant matches • Visualise document overlap and manually check. Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Multilingual NER Merging name variants 20% + 80% Condition: • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 handwritten rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold AE new entity • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

A resource-light method for cross-lingual semantic textual similarity

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ...

متن کامل

Providing Cross-Lingual Information Access with Knowledge-Poor Methods

We are proposing a simple, but efficient approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as w...

متن کامل

Detection of Paraphrastic Cases of Mono-lingual and Cross-lingual Plagiarism

External plagiarism detection is a unique retrieval process where the algorithm has to provide an evidence of plagiarism if any for a suspicious section from the pool of source documents available. This paper focuses on paraphrasing involved in detection of plagiarism both from monolingual and cross-lingual aspect. In order to investigate the challenges in detection, we further analyse the perf...

متن کامل

On Cross-lingual Plagiarism Analysis using a Statistical Model

The automatic detection of plagiarism is a task that has acquired relevance in the Information Retrieval area and it becomes more complex when the plagiarism is made in a multilingual panorama, where the original and suspicious texts are written in different languages. From a cross-lingual perspective, a text fragment in one language is considered a plagiarism of a text in another language if t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012